route EthosU input/output memcpy through overridable hook (#19264) by 3l1 · Pull Request #19264 · pytorch/executorch

3l1 · 2026-05-01T21:06:05Z

Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — arm_ethos_io_memcpy
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:

The weak default lives in its own TU so the compiler in the call-site
TUs cannot inline its body and bypass the link-time override. This is
the same pattern bolt_arm_memcpy_external uses.
Three call sites updated: input scratch copy in EthosUBackend.cpp, the
layout-adjustment chunk loop in EthosUBackend.cpp, and the output
scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766

pytorch-bot · 2026-05-01T21:06:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19264

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 20 Cancelled Jobs, 14 Pending, 7 Unrelated Failures

As of commit b6d333d with merge base cdcc915 ():

NEW FAILURES - The following jobs have failed:

Apple / build-frameworks-ios / macos-job (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
pull / test-coreml-bc-macos (macos-m1-stable) / macos-job (gh)
The process '/usr/bin/git' failed with exit code 1
pull / unittest / macos / macos-job (gh)
The process '/usr/bin/git' failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
The process '/usr/bin/git' failed with exit code 1
trunk / test-models-macos-cpu (mv3, xnnpack-quantization-delegation) / macos-job (gh)
The process '/usr/bin/git' failed with exit code 128
trunk / test-models-macos-cpu (w2l, xnnpack-quantization-delegation) / macos-job (gh)
The process '/usr/bin/git' failed with exit code 1

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / test-lora-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-lora-multimethod-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (ic4, portable, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (ic4, xnnpack-quantization-delegation, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (phi_4_mini, portable, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-moshi-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-phi-3-mini-runner-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-samsung-quantmodels-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-sqnr-static-llm-qnn-linux (smollm2_135m) / linux-job (gh)
##[error]The operation was canceled.
pull / test-vulkan-models-linux / linux-job (gh)
##[error]The operation was canceled.
pull / unittest / linux / linux-job (gh)
##[error]The operation was canceled.
pull / unittest-editable / linux / linux-job (gh)
##[error]The operation was canceled.
trunk / test-arm-ootb-linux (run_deit_e2e_ethos_u) / linux-job (gh)
##[error]The operation was canceled.
trunk / test-huggingface-transformers-xnnpack (qwen3-0.6b|xnnpack|--quantize) / linux-job (gh)
##[error]The operation was canceled.
trunk / test-huggingface-transformers-xnnpack (qwen3-1.7b|xnnpack|--quantize) / linux-job (gh)
##[error]The operation was canceled.
trunk / test-huggingface-transformers-xnnpack (smollm2-135m|xnnpack|--quantize) / linux-job (gh)
##[error]The operation was canceled.
trunk / test-huggingface-transformers-xnnpack (smollm3-3b|xnnpack|--quantize) / linux-job (gh)
##[error]The operation was canceled.
trunk / unittest-release / linux / linux-job (gh)
##[error]The operation was canceled.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh) (similar failure)
pull / unittest / windows / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
trunk / test-models-windows (mobilebert, portable) / windows-job (gh) (similar failure)
trunk / test-models-windows (mobilebert, xnnpack-q8) / windows-job (gh) (similar failure)
trunk / unittest-release / windows / windows-job (gh) (similar failure)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest-editable / windows / windows-job (gh) (trunk failure)
trunk / unittest-release / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 8

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-05-01T21:06:13Z

@3l1 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103455766.

github-actions · 2026-05-01T21:07:00Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

zingo

Nice idea, like it!

Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. Reviewed By: rascani Differential Revision: D103455766

Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-pytorch-ci-checks Reviewed By: rascani Differential Revision: D103455766

Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: rascani Differential Revision: D103455766

3l1 · 2026-05-06T01:24:08Z

⚠️ NOTE: many failing tests - looking... (suspect missing inclusion in some build script)

digantdesai · 2026-05-06T02:53:01Z

+// unit so the compiler in the call-site TUs cannot inline this body and
+// bypass the link-time override (same trick as bolt_arm_memcpy_external).
+extern "C" __attribute__((weak)) void
+io_memcpy(void* dst, const void* src, size_t size) {


regular memcpy should already be weak for embedded toolchain or we may be able to override through compiler flags but this is also OK.

note that we do not want a wide override in the subsystems or modules - eg. we enable this override to DMA on specific zephyr overlay configs for specific app_versions only, ie we want 'this specific workload only, copying tensors back and forth for the NPU' to be offloaded to hardware DMA since it also had its tradeoffs.

Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: rascani Differential Revision: D103455766

zingo · 2026-05-06T11:36:20Z

Coretx-M testing:
EthosUBackend.cpp:(.text._ZNK10executorch8backends3arm13EthosUBackend7executeERNS_7runtime23BackendExecutionContextEPvNS3_4SpanIPNS3_6EValueEEE[_ZNK10executorch8backends3arm13EthosUBackend7executeERNS_7runtime23BackendExecutionContextEPvNS3_4SpanIPNS3_6EValueEEE]+0xd2): undefined reference to arm_ethos_io_memcpy'`

This was an interesting side effect, It seem we are building the backend here when we probably should/could avoid it.

Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: rascani Differential Revision: D103455766

Summary: Pull Request resolved: #19264 The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: rascani Differential Revision: D103455766

Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: rascani Differential Revision: D103455766

3l1 requested a review from digantdesai as a code owner May 1, 2026 21:06

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2026

github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels May 1, 2026

meta-codesync Bot added fb-exported meta-exported labels May 1, 2026

meta-codesync Bot force-pushed the export-D103455766 branch from ddea8da to ffc9927 Compare May 1, 2026 21:07

3l1 added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label May 1, 2026

3l1 requested a review from gggekov May 1, 2026 21:10

zingo approved these changes May 4, 2026

View reviewed changes

Comment thread backends/arm/runtime/EthosUBackend_IoMemcpy.cpp

rascani approved these changes May 4, 2026

View reviewed changes

meta-codesync Bot changed the title ~~route EthosU input/output memcpy through overridable hook~~ route EthosU input/output memcpy through overridable hook (#19264) May 5, 2026

meta-codesync Bot force-pushed the export-D103455766 branch from ffc9927 to 8eeb57c Compare May 5, 2026 22:00

meta-codesync Bot force-pushed the export-D103455766 branch from 8eeb57c to 3fe2220 Compare May 5, 2026 23:59

meta-codesync Bot force-pushed the export-D103455766 branch from 3fe2220 to 845995e Compare May 6, 2026 00:13

digantdesai reviewed May 6, 2026

View reviewed changes

digantdesai approved these changes May 6, 2026

View reviewed changes

meta-codesync Bot force-pushed the export-D103455766 branch from 845995e to efec58a Compare May 6, 2026 06:09

meta-codesync Bot force-pushed the export-D103455766 branch from efec58a to c4b0f13 Compare May 6, 2026 18:13

meta-codesync Bot requested a review from larryliu0820 as a code owner May 6, 2026 18:13

meta-codesync Bot requested a review from kirklandsign as a code owner May 6, 2026 18:13

3l1 force-pushed the export-D103455766 branch from c4b0f13 to eb64ab4 Compare May 6, 2026 18:25

meta-codesync Bot force-pushed the export-D103455766 branch 2 times, most recently from 00b91bc to b6d333d Compare May 6, 2026 18:37

meta-codesync Bot merged commit af90130 into main May 6, 2026
385 of 454 checks passed

meta-codesync Bot deleted the export-D103455766 branch May 6, 2026 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

route EthosU input/output memcpy through overridable hook (#19264)#19264

route EthosU input/output memcpy through overridable hook (#19264)#19264
meta-codesync[bot] merged 1 commit intomainfrom
export-D103455766

3l1 commented May 1, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented May 1, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

zingo left a comment

Uh oh!

Uh oh!

3l1 commented May 6, 2026 •

edited

Loading

Uh oh!

digantdesai May 6, 2026

Uh oh!

3l1 May 6, 2026 •

edited

Loading

Uh oh!

zingo commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

3l1 commented May 1, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19264

❌ 6 New Failures, 20 Cancelled Jobs, 14 Pending, 7 Unrelated Failures

Uh oh!

meta-codesync Bot commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

This PR needs a release notes: label

Uh oh!

zingo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

3l1 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

digantdesai May 6, 2026

Choose a reason for hiding this comment

Uh oh!

3l1 May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zingo commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

3l1 commented May 1, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented May 1, 2026 •

edited

Loading

This PR needs a `release notes:` label

3l1 commented May 6, 2026 •

edited

Loading

3l1 May 6, 2026 •

edited

Loading